1 Overview

1.1 Data Source

Kaggle - Telco Customer Churn

Column Name Description Remark
CustomerID 識別每個客戶的唯一 ID String
Gender 顧客的性別 Male, Female
Age 財務季度結束時客戶的當前年齡(以年為單位) Number
Senior Citizen 是否年滿 65 歲 Yes, No
Married (Partner) 是否已婚 Yes, No
Dependents 是否與任何家屬同住 Yes, No.
Number of Dependents 是否與任何受扶養人同住 Yes, No.
Phone Service 是否向公司訂購了家庭電話服務 Yes, No
Multiple Lines 是否與公司預訂了多條電話線路 Yes, No
Internet Service 是否向本公司訂購網路服務 No, DSL, Fiber Optic, Cable.
Online Security 是否訂閱了公司提供的附加線上安全服務 Yes, No
Online Backup 是否訂閱了本公司提供的附加線上備份服務 Yes, No
Device Protection Plan 是否訂閱了該公司為其互聯網設備提供的附加設備保護計劃 Yes, No
Premium Tech Support 是否訂閱了公司的附加技術支援計劃以減少等待時間 Yes, No
Streaming TV 是否使用其網路服務從第三方供應商串流媒體電視節目 Yes, No.
Streaming Movies 是否使用其 Internet 服務從第三方供應商串流影片 Yes, No.
Contract 客戶目前的合約類型 Month-to-Month, One Year, Two Year.
Paperless Billing 客戶是否選擇無紙化計費 Yes, No
Payment Method 客戶如何支付帳單 Bank Withdrawal, Credit Card, Mailed Check
Monthly Charge 客戶目前每月為本公司提供的所有服務支付的總費用 Number
Total Charges 截至上述指定季度末計算的客戶總費用 Number
Tenure 客戶在公司工作的總月數 Number
Churn 是 = 客戶本季離開了公司;否 = 客戶仍留在公司 Yes, No

1.2 Analytics Target

  • 什麼樣特徵的人容易 Churn?
  • 誰會 Churn? 準確度多少?
  • 我們能夠可以有什麼 Retension Program?

2 Data Profiling

Read the dataset.

2.1 Introduce the Data

2.1.1 introduce (exclude?)

rows columns discrete_columns continuous_columns all_missing_columns total_missing_values complete_rows total_observations memory_usage
7043 21 17 4 0 11 7032 147903 1641832

2.1.2 plot_intro

Plot basic description for the data, including:

  • columns (features) : discrete / continuous, missing columns

  • rows (customers) : complete rows.

  • missing observations.

2.1.3 plot_missing

Only display the features which have missing data.

2.2 Discrete

2.2.1 plot_bar

Show the descrete data.

## 1 columns ignored with more than 50 categories.
## customerID: 7043 categories

2.3 Continuous

2.3.1 plot_histogram

Show the continuous data.

2.3.2 plot_qq (w/o “Churn”)

Q-Q plot : campare the data (scatter) with the other distribution (line).

## Warning: Removed 11 rows containing non-finite outside the scale range
## (`stat_qq()`).
## Warning: Removed 11 rows containing non-finite outside the scale range
## (`stat_qq_line()`).

2.3.3 plot_qq (w/i “Churn”)

Q-Q plot : campare the data (scatter) with the other distribution (line).

## Warning: Removed 11 rows containing non-finite outside the scale range
## (`stat_qq()`).
## Warning: Removed 11 rows containing non-finite outside the scale range
## (`stat_qq_line()`).

2.3.4 plot_boxplot

Box plot - Plot the outlier with red.

## Warning: Removed 11 rows containing non-finite outside the scale range
## (`stat_boxplot()`).

2.3.5 plot_prcomp

PCA

## 1 features with more than 50 categories ignored!
## customerID: 7032 categories

3 Data Cleaning

3.1 Overview

The steps involved in data cleaning:

  1. Check whether the data types are correct for each variable using str() function.

  2. Handling Missing Values:

    2.1. Perform KNN (K-Nearest Neighbors) imputation specifically for the “TotalCharges” variable.

  3. Standardizing Data(Convert text to a consistent case):

    3.1. Conditionally transform values that start with “N” and replace them with “No”.

3.2 Comparison of Original and Cleaned Data

  • Original Data
## 'data.frame':    7043 obs. of  21 variables:
##  $ customerID      : chr  "7590-VHVEG" "5575-GNVDE" "3668-QPYBK" "7795-CFOCW" ...
##  $ gender          : chr  "Female" "Male" "Male" "Male" ...
##  $ SeniorCitizen   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Partner         : chr  "Yes" "No" "No" "No" ...
##  $ Dependents      : chr  "No" "No" "No" "No" ...
##  $ tenure          : int  1 34 2 45 2 8 22 10 28 62 ...
##  $ PhoneService    : chr  "No" "Yes" "Yes" "No" ...
##  $ MultipleLines   : chr  "No phone service" "No" "No" "No phone service" ...
##  $ InternetService : chr  "DSL" "DSL" "DSL" "DSL" ...
##  $ OnlineSecurity  : chr  "No" "Yes" "Yes" "Yes" ...
##  $ OnlineBackup    : chr  "Yes" "No" "Yes" "No" ...
##  $ DeviceProtection: chr  "No" "Yes" "No" "Yes" ...
##  $ TechSupport     : chr  "No" "No" "No" "Yes" ...
##  $ StreamingTV     : chr  "No" "No" "No" "No" ...
##  $ StreamingMovies : chr  "No" "No" "No" "No" ...
##  $ Contract        : chr  "Month-to-month" "One year" "Month-to-month" "One year" ...
##  $ PaperlessBilling: chr  "Yes" "No" "Yes" "No" ...
##  $ PaymentMethod   : chr  "Electronic check" "Mailed check" "Mailed check" "Bank transfer (automatic)" ...
##  $ MonthlyCharges  : num  29.9 57 53.9 42.3 70.7 ...
##  $ TotalCharges    : num  29.9 1889.5 108.2 1840.8 151.7 ...
##  $ Churn           : chr  "No" "No" "Yes" "No" ...
  • Cleaned Data
## 'data.frame':    7043 obs. of  21 variables:
##  $ customerID      : chr  "7590-VHVEG" "5575-GNVDE" "3668-QPYBK" "7795-CFOCW" ...
##  $ gender          : chr  "Female" "Male" "Male" "Male" ...
##  $ SeniorCitizen   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Partner         : chr  "Yes" "No" "No" "No" ...
##  $ Dependents      : chr  "No" "No" "No" "No" ...
##  $ tenure          : int  1 34 2 45 2 8 22 10 28 62 ...
##  $ PhoneService    : chr  "No" "Yes" "Yes" "No" ...
##  $ MultipleLines   : chr  "No" "No" "No" "No" ...
##  $ InternetService : chr  "DSL" "DSL" "DSL" "DSL" ...
##  $ OnlineSecurity  : chr  "No" "Yes" "Yes" "Yes" ...
##  $ OnlineBackup    : chr  "Yes" "No" "Yes" "No" ...
##  $ DeviceProtection: chr  "No" "Yes" "No" "Yes" ...
##  $ TechSupport     : chr  "No" "No" "No" "Yes" ...
##  $ StreamingTV     : chr  "No" "No" "No" "No" ...
##  $ StreamingMovies : chr  "No" "No" "No" "No" ...
##  $ Contract        : chr  "Month-to-month" "One year" "Month-to-month" "One year" ...
##  $ PaperlessBilling: chr  "Yes" "No" "Yes" "No" ...
##  $ PaymentMethod   : chr  "Electronic check" "Mailed check" "Mailed check" "Bank transfer (automatic)" ...
##  $ MonthlyCharges  : num  29.9 57 53.9 42.3 70.7 ...
##  $ TotalCharges    : num  29.9 1889.5 108.2 1840.8 151.7 ...
##  $ Churn           : chr  "No" "No" "Yes" "No" ...

4 Feature Engineering

4.1 Overview

The steps involved in feature engineering:

  1. Remove specific columns (“TotalCharges”)

    1.1. The removal of “customerID” is typically an identifier column that doesn’t contain useful information for modeling

    1.2. The removal of “TotalCharges” is justified by its high correlation with other columns (“MonthlyCharges” x “tenure” = “TotalCharges”), potentially leading to multicollinearity issues in modeling.

  2. Encoding Categorical Variables:

    2.1. Label Encoding: Converting categorical data to numbers where the order matters.

    2.2. One-Hot Encoding: Converting categorical data to a binary (0 or 1) format.

  3. Scaling and Normalization:

    3.1. Min-Max Scaling: Scaling data to a fixed range 0 to 1.

  4. Oversample the Minority Class:

    4.1. Random Oversampling: New instances are essentially duplicates of existing minority class instances, leading to an increase in the number of samples belonging to the minority class.

  5. Feature Selection:

    5.1. Random Forest Importance

## Warning in randomForest.default(m, y, ...): The response has five or fewer
## unique values.  Are you sure you want to do regression?

4.2 Comparison of Cleaned Data and Feature-engineered Data

  • Cleaned Data
## 'data.frame':    7043 obs. of  21 variables:
##  $ customerID      : chr  "7590-VHVEG" "5575-GNVDE" "3668-QPYBK" "7795-CFOCW" ...
##  $ gender          : chr  "Female" "Male" "Male" "Male" ...
##  $ SeniorCitizen   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Partner         : chr  "Yes" "No" "No" "No" ...
##  $ Dependents      : chr  "No" "No" "No" "No" ...
##  $ tenure          : int  1 34 2 45 2 8 22 10 28 62 ...
##  $ PhoneService    : chr  "No" "Yes" "Yes" "No" ...
##  $ MultipleLines   : chr  "No" "No" "No" "No" ...
##  $ InternetService : chr  "DSL" "DSL" "DSL" "DSL" ...
##  $ OnlineSecurity  : chr  "No" "Yes" "Yes" "Yes" ...
##  $ OnlineBackup    : chr  "Yes" "No" "Yes" "No" ...
##  $ DeviceProtection: chr  "No" "Yes" "No" "Yes" ...
##  $ TechSupport     : chr  "No" "No" "No" "Yes" ...
##  $ StreamingTV     : chr  "No" "No" "No" "No" ...
##  $ StreamingMovies : chr  "No" "No" "No" "No" ...
##  $ Contract        : chr  "Month-to-month" "One year" "Month-to-month" "One year" ...
##  $ PaperlessBilling: chr  "Yes" "No" "Yes" "No" ...
##  $ PaymentMethod   : chr  "Electronic check" "Mailed check" "Mailed check" "Bank transfer (automatic)" ...
##  $ MonthlyCharges  : num  29.9 57 53.9 42.3 70.7 ...
##  $ TotalCharges    : num  29.9 1889.5 108.2 1840.8 151.7 ...
##  $ Churn           : chr  "No" "No" "Yes" "No" ...
  • Feature-engineered Data
## 'data.frame':    7043 obs. of  25 variables:
##  $ customerID                            : chr  "7590-VHVEG" "5575-GNVDE" "3668-QPYBK" "7795-CFOCW" ...
##  $ gender                                : int  0 1 1 1 0 0 1 0 0 1 ...
##  $ SeniorCitizen                         : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Partner                               : int  1 0 0 0 0 0 0 0 1 0 ...
##  $ Dependents                            : int  0 0 0 0 0 0 1 0 0 1 ...
##  $ tenure                                : num  -1.2774 0.0663 -1.2366 0.5142 -1.2366 ...
##  $ PhoneService                          : int  0 1 1 0 1 1 1 0 1 1 ...
##  $ MultipleLines                         : int  0 0 0 0 0 1 1 0 1 0 ...
##  $ InternetService                       : int  1 1 1 1 2 2 2 1 2 1 ...
##  $ OnlineSecurity                        : int  0 1 1 1 0 0 0 1 0 1 ...
##  $ OnlineBackup                          : int  1 0 1 0 0 0 1 0 0 1 ...
##  $ DeviceProtection                      : int  0 1 0 1 0 1 0 0 1 0 ...
##  $ TechSupport                           : int  0 0 0 1 0 0 0 0 1 0 ...
##  $ StreamingTV                           : int  0 0 0 0 0 1 1 0 1 0 ...
##  $ StreamingMovies                       : int  0 0 0 0 0 1 0 0 1 0 ...
##  $ ContractMonth.to.month                : int  1 0 1 0 1 1 1 1 1 0 ...
##  $ ContractOne.year                      : int  0 1 0 1 0 0 0 0 0 1 ...
##  $ ContractTwo.year                      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ PaperlessBilling                      : int  1 0 1 0 1 1 1 0 1 0 ...
##  $ PaymentMethodBank.transfer..automatic.: int  0 0 0 1 0 0 0 0 0 1 ...
##  $ PaymentMethodCredit.card..automatic.  : int  0 0 0 0 0 0 1 0 0 0 ...
##  $ PaymentMethodElectronic.check         : int  1 0 0 0 1 1 0 0 1 0 ...
##  $ PaymentMethodMailed.check             : int  0 1 1 0 0 0 0 1 0 0 ...
##  $ MonthlyCharges                        : num  -1.16 -0.26 -0.363 -0.746 0.197 ...
##  $ Churn                                 : int  0 0 1 0 1 1 0 0 1 0 ...

5 Model Training

  1. Read the dataset
data <- read_csv("../data/feature/feature_set.csv")
  1. Separate the data into training and testing set
n <- nrow(data)
data <- data[sample(n),]  #將資料進行隨機排列
index <- createDataPartition(data$Churn, p = 0.8, list = FALSE)
train_data <- data[index, ] 
test_data <- data[-index, ]
  1. Using training set to train the xgboost model
train_matrix <- xgb.DMatrix(data = as.matrix(train_data[-c(1, which(names(train_data) == "Churn"))]), label = train_data$Churn)

test_matrix <- xgb.DMatrix(data = as.matrix(test_data[-c(1, which(names(test_data) == "Churn"))]), label = test_data$Churn)
  1. k-fold cross validation
params <- list(
  objective = "binary:logistic",
  eval_metric = "auc",
  eta = 0.1,
  max_depth = 6
)

nrounds <- 100
nfold <- 5
early_stopping_rounds <- 10
verbose <- 1

# using xgb.cv to do k-fold cross validation 
cv_result <- xgb.cv(
  params = params,
  data = train_matrix,
  nrounds = nrounds,
  nfold = nfold,
  early_stopping_rounds = early_stopping_rounds,
  verbose = verbose
)
## [1]  train-auc:0.860743+0.004023 test-auc:0.828446+0.023039 
## Multiple eval metrics are present. Will use test_auc for early stopping.
## Will train until test_auc hasn't improved in 10 rounds.
## 
## [2]  train-auc:0.865658+0.003397 test-auc:0.832031+0.021830 
## [3]  train-auc:0.869886+0.004021 test-auc:0.835190+0.021159 
## [4]  train-auc:0.871866+0.003650 test-auc:0.835934+0.020832 
## [5]  train-auc:0.874667+0.004006 test-auc:0.837381+0.020441 
## [6]  train-auc:0.876757+0.004224 test-auc:0.837741+0.021015 
## [7]  train-auc:0.878987+0.004664 test-auc:0.838346+0.020821 
## [8]  train-auc:0.880340+0.004764 test-auc:0.838527+0.020790 
## [9]  train-auc:0.881991+0.004253 test-auc:0.838548+0.020479 
## [10] train-auc:0.883548+0.004300 test-auc:0.839295+0.020375 
## [11] train-auc:0.884952+0.003763 test-auc:0.839602+0.020473 
## [12] train-auc:0.886351+0.003854 test-auc:0.840465+0.020549 
## [13] train-auc:0.887973+0.003749 test-auc:0.840873+0.020796 
## [14] train-auc:0.889495+0.003588 test-auc:0.841287+0.021207 
## [15] train-auc:0.891050+0.004001 test-auc:0.841721+0.021061 
## [16] train-auc:0.892332+0.004076 test-auc:0.842097+0.021574 
## [17] train-auc:0.893663+0.004112 test-auc:0.842224+0.021416 
## [18] train-auc:0.895172+0.003788 test-auc:0.841956+0.021221 
## [19] train-auc:0.896353+0.004053 test-auc:0.842118+0.021523 
## [20] train-auc:0.897585+0.003920 test-auc:0.842256+0.021198 
## [21] train-auc:0.898803+0.003980 test-auc:0.842433+0.021233 
## [22] train-auc:0.899815+0.004086 test-auc:0.842617+0.021066 
## [23] train-auc:0.900759+0.004255 test-auc:0.842591+0.021022 
## [24] train-auc:0.901873+0.004339 test-auc:0.842154+0.020760 
## [25] train-auc:0.902729+0.004202 test-auc:0.842282+0.020908 
## [26] train-auc:0.903614+0.004248 test-auc:0.842561+0.020914 
## [27] train-auc:0.904406+0.004190 test-auc:0.842735+0.020948 
## [28] train-auc:0.905227+0.004289 test-auc:0.842752+0.021067 
## [29] train-auc:0.906209+0.004341 test-auc:0.842606+0.021287 
## [30] train-auc:0.907177+0.004456 test-auc:0.842585+0.021178 
## [31] train-auc:0.908163+0.004328 test-auc:0.842415+0.021261 
## [32] train-auc:0.909006+0.004255 test-auc:0.842405+0.021257 
## [33] train-auc:0.909909+0.004513 test-auc:0.842559+0.021393 
## [34] train-auc:0.910788+0.004548 test-auc:0.842498+0.021585 
## [35] train-auc:0.911612+0.004438 test-auc:0.842211+0.021635 
## [36] train-auc:0.912225+0.004474 test-auc:0.842090+0.021642 
## [37] train-auc:0.912929+0.004676 test-auc:0.841941+0.021602 
## [38] train-auc:0.913426+0.004645 test-auc:0.841943+0.021822 
## Stopping. Best iteration:
## [28] train-auc:0.905227+0.004289 test-auc:0.842752+0.021067
print(cv_result)
## ##### xgb.cv 5-folds
##   iter train_auc_mean train_auc_std test_auc_mean test_auc_std
##  <num>          <num>         <num>         <num>        <num>
##      1      0.8607430   0.004023058     0.8284462   0.02303896
##      2      0.8656585   0.003396626     0.8320313   0.02182972
##      3      0.8698862   0.004020688     0.8351896   0.02115889
##      4      0.8718658   0.003650155     0.8359339   0.02083240
##      5      0.8746666   0.004006366     0.8373814   0.02044100
##      6      0.8767569   0.004223795     0.8377409   0.02101495
##      7      0.8789868   0.004663579     0.8383465   0.02082135
##      8      0.8803395   0.004763656     0.8385266   0.02079024
##      9      0.8819908   0.004252530     0.8385476   0.02047856
##     10      0.8835484   0.004299779     0.8392954   0.02037471
##     11      0.8849516   0.003763471     0.8396021   0.02047257
##     12      0.8863507   0.003854375     0.8404646   0.02054875
##     13      0.8879733   0.003748580     0.8408727   0.02079586
##     14      0.8894951   0.003587989     0.8412867   0.02120681
##     15      0.8910499   0.004001213     0.8417206   0.02106110
##     16      0.8923321   0.004076423     0.8420971   0.02157389
##     17      0.8936628   0.004112324     0.8422242   0.02141595
##     18      0.8951718   0.003787858     0.8419558   0.02122072
##     19      0.8963529   0.004053059     0.8421179   0.02152256
##     20      0.8975850   0.003919699     0.8422561   0.02119816
##     21      0.8988031   0.003980489     0.8424331   0.02123267
##     22      0.8998149   0.004086370     0.8426167   0.02106615
##     23      0.9007594   0.004255407     0.8425907   0.02102223
##     24      0.9018729   0.004339329     0.8421536   0.02075988
##     25      0.9027289   0.004201750     0.8422820   0.02090800
##     26      0.9036141   0.004247534     0.8425611   0.02091397
##     27      0.9044057   0.004190053     0.8427351   0.02094780
##     28      0.9052268   0.004289175     0.8427520   0.02106684
##     29      0.9062089   0.004341347     0.8426062   0.02128704
##     30      0.9071769   0.004455786     0.8425846   0.02117822
##     31      0.9081635   0.004327935     0.8424150   0.02126060
##     32      0.9090064   0.004254807     0.8424053   0.02125747
##     33      0.9099088   0.004512705     0.8425587   0.02139308
##     34      0.9107879   0.004547732     0.8424975   0.02158477
##     35      0.9116117   0.004438449     0.8422114   0.02163533
##     36      0.9122248   0.004473660     0.8420899   0.02164200
##     37      0.9129294   0.004675734     0.8419408   0.02160185
##     38      0.9134262   0.004644753     0.8419434   0.02182182
##   iter train_auc_mean train_auc_std test_auc_mean test_auc_std
## Best iteration:
##   iter train_auc_mean train_auc_std test_auc_mean test_auc_std
##  <num>          <num>         <num>         <num>        <num>
##     28      0.9052268   0.004289175      0.842752   0.02106684
# Create DMatrix object
data_matrix <- xgb.DMatrix(data = as.matrix(data[-c(1, which(names(data) == "Churn"))]), label = data$Churn)

# get the best iteration
best_nrounds <- cv_result$best_iteration

model <- xgb.train(
  params = params,
  data = data_matrix,
  nrounds = best_nrounds,
  verbose = TRUE
)
  1. Predict label and probability
train_pred <- predict(model, train_matrix)
train_predicted_label <- ifelse(train_pred > 0.5, 1, 0)

test_pred <- predict(model, test_matrix)
test_predicted_label <- ifelse(test_pred > 0.5, 1, 0)
  1. Compute the ROC of training and testing data
roc_train <- roc(train_data$Churn, train_pred)
roc_test <- roc(test_data$Churn, test_pred)

auc_train <- auc(roc_train)
auc_test <- auc(roc_test)

print(paste("AUC of training data: ", auc_train))
## [1] "AUC of training data:  0.890639729276803"
print(paste("AUC of testing data: ", auc_test))
## [1] "AUC of testing data:  0.888676918185291"
  1. Null model comparison
# Null model prediction: using the mean of Churn in the training set as the probability
mean_churn <- mean(train_data$Churn)
null_train_pred <- rep(mean_churn, nrow(train_data))
null_test_pred <- rep(mean_churn, nrow(test_data))

# Compute the ROC of null model
roc_null_train <- roc(train_data$Churn, null_train_pred)
roc_null_test <- roc(test_data$Churn, null_test_pred)
auc_null_train <- auc(roc_null_train)
auc_null_test <- auc(roc_null_test)

Print AUC values for the null model

# Print AUC values for the null model
print(paste("AUC for null model on training data: ", auc_null_train))
## [1] "AUC for null model on training data:  0.5"
print(paste("AUC for null model on testing data: ", auc_null_test))
## [1] "AUC for null model on testing data:  0.5"
# 使用 DeLong 檢驗比較兩個 AUC
roc_test_xgb <- roc(test_data$Churn, test_pred)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
roc_test_null <- roc(test_data$Churn, null_test_pred)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
# DeLong 檢驗
delong_test <- roc.test(roc_test_xgb, roc_test_null, method="delong")

# 輸出檢驗結果
print(delong_test)
## 
##  DeLong's test for two correlated ROC curves
## 
## data:  roc_test_xgb and roc_test_null
## Z = 43.622, p-value < 2.2e-16
## alternative hypothesis: true difference in AUC is not equal to 0
## 95 percent confidence interval:
##  0.3712132 0.4061406
## sample estimates:
## AUC of roc1 AUC of roc2 
##   0.8886769   0.5000000
  1. Feature importance and ranking
importance_matrix <- xgb.importance(model = model, feature_names = colnames(train_data[-c(1, which(names(train_data) == "Churn"))]))
xgb.plot.importance(importance_matrix)

  1. Save the prediction result
train_results <- data.frame(customerID = train_data$customerID, label = train_predicted_label, probability = train_pred, groundtruth = train_data$Churn)

test_results <- data.frame(customerID = test_data$customerID, label = test_predicted_label, probability = test_pred, groundtruth = test_data$Churn)

write_csv(train_results, "train_predictions.csv")
write_csv(test_results, "test_predictions.csv")

head(test_results, 5)
##   customerID label probability groundtruth
## 1 0530-IJVDB     0  0.10891096           0
## 2 3208-YPIOE     0  0.42432478           1
## 3 3237-AJGEH     1  0.58590657           1
## 4 3211-ILJTT     0  0.35297447           1
## 5 7337-CINUD     0  0.04226052           0
  1. Save the feature importance result
write_csv(importance_matrix, "feature_importance.csv")
head(importance_matrix, 5)
##                          Feature       Gain      Cover  Frequency Importance
##                           <char>      <num>      <num>      <num>      <num>
## 1:        ContractMonth.to.month 0.43888966 0.16064444 0.01791868 0.43888966
## 2:                        tenure 0.18648048 0.20786294 0.23569952 0.18648048
## 3:               InternetService 0.11661886 0.09241762 0.02618884 0.11661886
## 4:                MonthlyCharges 0.10358910 0.18545820 0.30048243 0.10358910
## 5: PaymentMethodElectronic.check 0.02191083 0.06834488 0.05375603 0.02191083
  1. Save the XGBoost model
xgb.save(model, "../model/churn_prediction_model.xgb")
## [1] TRUE
  1. Save the ROC curve
# Assuming roc_train_plot and roc_test_plot are ggplot objects for ROC curves
roc_train_plot <- ggroc(roc_train) + ggtitle("ROC Curve - Training Data")
roc_test_plot <- ggroc(roc_test) + ggtitle("ROC Curve - Testing Data")

# Save ROC curves
ggsave("roc_train.png", plot = roc_train_plot)
## Saving 7 x 5 in image
ggsave("roc_test.png", plot = roc_test_plot)
## Saving 7 x 5 in image
print(roc_train_plot)

print(roc_test_plot)


6 Lift Analysis Introduction

Lift analysis 是一種在資料科學和機器學習中常用的評估技術,尤其在行銷和推薦系統中十分常見。主要目的是衡量一個策略、活動或模型相對於隨機選擇的效果提升。以下是 Lift analysis 的一些關鍵點:

6.1 Definition

\(Lift\) 是一個比率,表示目標行為在有策略介入時的發生率,與無策略介入時的發生率之比。

公式為 \[ Lift = \frac {P(B|A)}{P(B)}\]

其中 \(P(B∣A)\) 是在條件 \(A\) 下發生 \(B\) 的概率,而 \(P(B)\) 是無條件下發生 \(B\) 的概率。

6.2 Goal

\(Lift\) 分析幫助確定某個特定行動或模型是否對結果有正向影響,以及這個影響是否顯著超過隨機事件。

6.3 Applocation

  • 行銷活動:分析特定行銷活動對購買行為的影響。
  • 推薦系統:評估推薦算法是否有效提高用戶的點擊率或購買率。
  • 風險評估:在金融業中,用於評估某策略對減少欺詐行為的有效性。

6.4 Pros and Cons

  • 優點:直觀,容易理解和溝通;有助於快速識別最有效的策略或客戶群體。
  • 限制:不考慮潛在的偏誤或外部影響因素;高 \(Lift\) 值不一定代表高絕對效益,特別是基礎概率 \(P(B)\) 很低時。

(Credit by ChatGPT)

  1. Lift analysis of the prediction result
# sorting by probability
test_results <- test_results[order(-test_results$probability),]

head(test_results, 10)
##      customerID label probability groundtruth
## 90   7665-TOALD     1   0.8760448           1
## 95   1447-GIQMR     1   0.8760448           1
## 160  1820-TQVEV     1   0.8760448           1
## 377  5134-IKDAY     1   0.8742824           1
## 92   5186-SAMNZ     1   0.8740061           1
## 35   6894-LFHLY     1   0.8727815           1
## 301  2868-MZAGQ     1   0.8720479           1
## 1216 0495-RVCBF     1   0.8720479           1
## 41   9300-AGZNL     1   0.8701169           1
## 413  7274-RTAPZ     1   0.8647000           1
# Segmenat the customer
test_results$decile <- cut(test_results$probability, breaks=quantile(test_results$probability, probs=seq(0, 1, by = 0.1)), include.lowest=TRUE, labels=FALSE)

# Reverse the decilne numbering
test_results$decile <- 11 - test_results$decile

head(test_results, 5)
##     customerID label probability groundtruth decile
## 90  7665-TOALD     1   0.8760448           1      1
## 95  1447-GIQMR     1   0.8760448           1      1
## 160 1820-TQVEV     1   0.8760448           1      1
## 377 5134-IKDAY     1   0.8742824           1      1
## 92  5186-SAMNZ     1   0.8740061           1      1
# 計算每個分組的實際響應率和 Lift
test_lift_df <- test_results %>%
  group_by(decile) %>%
  summarise(
    count = n(),
    num_responses = sum(label),
    response_rate = mean(label),
    lift = response_rate / mean(data$Churn)
  )

plot <- ggplot(test_lift_df, aes(x = as.factor(decile), y = lift)) +
  geom_bar(stat = "identity", fill = "steelblue") +
  labs(title = "Lift Chart", x = "Decile", y = "Lift") +
  theme_minimal()

# 使用 ggsave 保存圖形
ggsave("lift_chart.png", plot, width = 10, height = 6, dpi = 300)
img <- readPNG("lift_chart.png")
print(plot)

# Save the lift data to CSV
write.csv(test_lift_df, "lift_data.csv", row.names = FALSE)
head(test_lift_df, 10)
## # A tibble: 10 × 5
##    decile count num_responses response_rate  lift
##     <dbl> <int>         <dbl>         <dbl> <dbl>
##  1      1   141           141        1      3.77 
##  2      2   141           141        1      3.77 
##  3      3   141            13        0.0922 0.347
##  4      4   140             0        0      0    
##  5      5   141             0        0      0    
##  6      6   141             0        0      0    
##  7      7   140             0        0      0    
##  8      8   141             0        0      0    
##  9      9   125             0        0      0    
## 10     10   157             0        0      0